Digitized image of fine needle aspirate (FNA) of breast mass from 569 patients
Figure 2, Street (1993)
Outcome Clinical diagnosis (malignant or benign)
Predictors
radius (mean of distances from center to points on the perimeter)
texture (standard deviation of gray-scale values)
perimeter
area
smoothness (local variation in radius lengths)
compactness (perimeter^2 / area - 1.0)
concavity (severity of concave portions of the contour)
concave points (number of concave portions of the contour)
symmetry
fractal dimension (“coastline approximation” - 1)
Data include mean, “standard error”, and worst measurements.
About 35% of diagnoses were malignant
Task: Build a prediction model to predict probability of being malignant given cell characteristics.
Use the training data (‘breast_dx_train.csv’) and the validation data (‘breast_dx_train.csv’) to build and select the model.
You can use logistic regression with any of model building approaches we considered, or something else. You can alternatively build a classifier using a machine learning approach. Note that if you develop a model that only classifies observations, your MSPE, Absolute, and 0-1 loss will all be the same.
Use the training and validation data any way you want, but do not use the test data until you’ve selected one final model. No cheating and no going back to fiddle with the model after you’ve seen the test data!
Evaluate your one model on the test data (‘breast_dx_test.csv’) and report your performance metrics here:
library(tidyverse) #for read_csvlibrary(MASS); #stepAIClibrary(pROC);#pROClibrary(logistf);#logistflibrary(glmnet);#glmnetlibrary(glmnetUtils);#formula interface for glmnet# 300 randomly selected observations for trainingbreast_dx_train <-read_csv("https://raw.githubusercontent.com/psboonstra/prediction-lecture/refs/heads/main/breast_dx_train.csv")# 135 randomly selected observations for validationbreast_dx_validation <-read_csv("https://raw.githubusercontent.com/psboonstra/prediction-lecture/refs/heads/main/breast_dx_validation.csv")# 134 remaining observations for testingbreast_dx_test <-read_csv("https://raw.githubusercontent.com/psboonstra/prediction-lecture/refs/heads/main/breast_dx_test.csv")# Build your model...# Get predictions from your model with: test_predictions <-predict(my_model, newdata = breast_dx_test, type ="response")# Get MSPE, Absolute, 0-1 loss, and deviance using code from lecture# Get AUCroc(response = breast_dx_test$malignant, predictor = test_predictions)
Results
References
Mangasarian, O.L., Street, W.N. and Wolberg, W.H., 1995. Breast cancer diagnosis and prognosis via linear programming. Operations research, 43(4), pp.570-577.
Street, W.N., Wolberg, W.H. and Mangasarian, O.L., 1993, July. Nuclear feature extraction for breast tumor diagnosis. In Biomedical image processing and biomedical visualization (Vol. 1905, pp. 861-870). SPIE.